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Abstract 

Training and deploying large machine learning (ML) mod- 
els is time-consuming and requires significant distributed com- 
puting infrastructures. Based on real-world large model train- 
ing on datacenter-scale infrastructures, we show 14~32% of 
all GPU hours are spent on communication with no overlap- 
ping computation. To minimize the outstanding communica- 
tion latency, in this work, we develop an agile performance 
modeling framework to guide parallelization and hardware- 
software co-design strategies. Using the suite of real-world 
large ML models on state-of-the-art GPU training hardware, 
we demonstrate 2.24 and 5.27 x throughput improvement po- 
tential for pre-training and inference scenarios, respectively. 


1. Introduction 


Billion-parameter large language models (LLMs) [6, 45, 58, 
59] power applications that have shown far-reaching impact 
across different domains [35, 11, 12, 44]. Similarly, trillion- 
parameter recommendation models [37, 70] have demon- 
strated state-of-the-art user modeling and content understand- 
ing across search [3, 8, 28, 74], social media [1, 16, 17, 69], 
e-commerce [76, 77], and entertainment [18]. As these large 
ML models increase in size and complexity [16, 17], the cor- 
responding training and inference workloads become ever 
more resource-intensive. Identifying better mappings between 
ML workloads and distributed systems can provide signifi- 
cant infrastructure benefits by reducing million-hour training 
times [6, 58, 59] and enabling faster exploration of novel 
model architectures on new hardware systems. 

As large ML models expand beyond single-node platforms, 
successful training and inference solutions have to take into 
account the underlying distributed systems and hardware de- 
vices [10, 25, 26, 27, 41, 39, 40, 42]. In order to leverage 
advancements in compute, memory, and interconnect of data 
center scale distributed systems, developers must consider how 
to map models and tasks onto underlying distributed systems 
— parallelization strategy. Figure | shows the impact that 
parallelization strategy can have on training performance of 
important large ML models: Deep Learning Recommendation 
Models (DLRM) and Large Language Models (LLM). An 
optimal parallelization strategy achieves 34% higher training 
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Figure 1: For large ML model training, exploring parallelization strat- 
egy design space can lead to strategies with higher system throughput 
(blue) than common strategies such as FSDP and TP. 


throughput as compared to the Fully Sharded Data Parallel 
(FSDP) strategy for both DLRM and LLM. Unlocking this 
performance potential requires an agile performance model- 
ing framework that considers the interactions between model 
architectures, machine learning tasks, and hardware systems. 

Generally speaking, current approaches for training and 
deploying large ML models fall into three categories. The 
first option is to apply industry-grade parallelization strategies 
that give best guarantees for feasibility while providing ade- 
quate system performance on existing distributed systems (e.g., 
FSDP, ZeRO [73, 51]). This often comes at the expense of 
fully utilizing the underlying hardware. The second option is 
to carefully engineer custom, hierarchical parallelization strate- 
gies for the target model, task, and distributed system [56]. 
While this likely best utilizes underlying hardware, the engi- 
neering endeavor required for such customized parallelization 
strategy is nontrivial and the strategy itself may be not trans- 
ferable between tasks. The third option is to estimate system 
performance via a software tool before either training or de- 
ployment. Existing tools for predicting distributed ML per- 
formance either require mature software implementations, are 
training-task specific, or tailored to specific hardware architec- 
tures among other limitations. or a combination of the above. 
To address the need for an agile exploration tool for paral- 
lelization strategies tailored to different use-cases, we propose 
a distributed ML performance model and evaluate it on a suite 
of real-world, large ML models, including deep learning rec- 


ommender systems and LLMs [5, 6, 7, 9, 38, 48, 58, 59, 74]. 
In this work, we first characterize the suite of real-world, 
large ML models at both model- and datacenter-deployment 
scales (Section 3). At the model architecture level, we identify 
performance-critical hardware requirements based on the mod- 
els’ compute and memory characteristics. At the datacenter 
scale, we quantify the communication required by conducting 
a fleet-wide characterization of at-scale training, showing that 
14~32% of all GPU hours are spent on communication with 
no concurrent computation (i.e., exposed communication). 
To enable agile exploration of the parallelization design 
space, in this paper, MAD Max Beyond Single-Node, we 
propose a performance model for estimating system perfor- 
mance of a distributed ML workload for distributed systems 
of unique characteristics. The performance model takes into 
account target ML model architecture, task, parallelization 
scheme, and distributed system hardware to generated per- 
device traces. These per-device traces can then be pieced 
together to estimate the overall system performance of the tar- 
get ML model and task. Additionally, the performance model 
generates detailed breakdowns of both communication col- 
lectives and computation-communication overlap efficiency, 
enabling users to identify future optimization opportunities. 

Our performance model is validated against multiple real large- 

scale distributed training experiments, demonstrating 97% and 

91% performance prediction accuracies on serialized and over- 

lapped execution, respectively, for real world use-cases. 
Using this performance model, we identify parallelization 

strategies with up to 2.24 and 5.27x throughput improve- 
ment for pre-training and inference, respectively across our 
suite of large ML models. When considering parallelization 
strategies not limited by the memory capacity of current train- 
ing systems, we identify strategies that achieve 2.43 x and 
12.13 throughput improvement for pre-training and infer- 
ence, respectively. Using the performance model, we point out 
how model-level compute and communication requirements 
alter optimal parallelization strategy and increasing LLM con- 
text length calls for solutions beyond solely parallelization ex- 
ploration (Section 6). We also conduct a retrospective study on 
how different generations of GPUs impact overall training effi- 
ciency and follow up with a future technologies scaling study 
by showing the effects of improving systems components like 
compute efficiency, memory capacity and bandwidth, and hi- 
erarchical interconnect bandwidth (Section 6). 
The main contributions of this work are as follows: 

e We propose a performance model that enables agile explo- 
ration of the distributed ML training and deployment design 
space. Our performance model targets both implemented 
and future models alike, allowing for accurate throughput 
performance estimation with different model architectures, 
tasks, hardware devices, and distributed systems. 

e We show model-level insights on how parallelization strate- 
gies interact with DLRM and its model architecture vari- 
ants. We show how asymmetric compute and communica- 


tion requirements from transformer and mixture-of-experts 

variants lead to different optimal parallelization strategies. 

Additionally, we note the limits of solely optimizing paral- 

lelization strategies on LLMs of increasing context length. 
e We show that to improve throughput performance for both 

training and inference of large ML models, hardware speci- 
fications across compute, memory, and interconnect have to 
be concurrently improved. 

We will open-source! the proposed performance model to 
enable follow-on work for modeling the interaction between 
parallelization strategies, models, tasks, and distributed sys- 
tems on ML system performance. 


2. Background 


In this section, we introduce a suite of model architectures 
across both recommender systems and LLMs. We then outline 
three tasks for these models: pre-training, fine-tuning, and 
inference (Section 2.1). Lastly, we discuss the parallelization 
strategies currently used to map the workloads (i.e., model and 
task) onto the distributed systems (Section 2.2). 


2.1. Models and Tasks 


Deep learning based recommender systems and LLMs fol- 
low the general model architecture of representing categorical 
inputs as embedding vectors and then processing these em- 
bedding vectors with model-specific computation layers. This 
means that there are many shared components: embedding 
tables, Multilayer Perceptrons (MLPs), and more intricate pro- 
cessing layers like transformer blocks that are emphasized to 
different degrees by each model. We focus on the following 
five classes of models throughout the paper: 

1. DLRM. The canonical at-scale recommendation model 
takes two types of inputs: dense and sparse features. Dense 
features, such as, user age and current time, are processed 
by MLP layers while the sparse categorical features are 
processed as lookups into large embedding tables. The pro- 
cessed results are fed into a feature interaction layer, where 
these intermediate results are either concatenated or multi- 
plied with one another via dot products [61, 62]. The result 
of this feature interaction layer is then fed into MLP layers 
to generate predictions like Click-Through Rate (CTR) [38]. 
For many large-scale DLRM models, storing and communi- 
cating trillion-parameter scale embedding tables is the pri- 
mary system bottleneck [14, 16, 20, 21, 30, 31, 37, 55, 64]. 

2. DLRM-Transformer. As sparse features for recommen- 
dation models have become complex, model architectures 
have also evolved to better model implicit relationships be- 
tween these sparse features. Some DLRM variants replace 
concatenation and dot-product based feature interactions 
with transformer encoder layers that model higher-order 
interactions and sequential relationship between sparse fea- 
tures. Others [7, 48, 70] use transformer-style feature inter- 
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Embedding Table 
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Tensor Parallel (TP) 


Shard parameters. Use local parameters 
to compute partial sums. All Reduce 
activation and gradient partial sums 
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Distributed Data Parallel (DDP) 
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parameters to compute activations and 
gradients. During backward pass, 
AllReduce weight gradients. 


Figure 2: For recommendation models, applying FSDP, TP, or DDP on an MLP layer requires sharding or replicating parameters and 
communicating parameters (orange) or partial sums (yellow). In this example, the embedding table’s large capacity requires it to be sharded. 


action layers to tackle challenges like behavior sequence 
modeling and personalized re-ranking. From a systems per- 
spective, transformer layers increase both required compute 
and computation-communication overlap opportunities. 

3. DLRM-MoKE. In the context of DLRMs, applying Mixture- 
of-Experts (MoE) creates parallel Top MLPs that are con- 
ditionally activated based on feature interaction [74]. Be- 
cause only a fraction of experts are active for each sam- 
ple, DLRM-MoE increases model capacity and (expert-to- 
expert) communication while scaling computation only by 
the number of active experts. 

4. LLM. Large language models (LLMs) also use the “look 
up embeddings then process them" architecture [6, 19, 50, 
58, 72]. However, instead of using user and content categor- 
ical features, LLMs convert tokens — character sequences 
— to input embeddings. Subsequent processing layers use 
alternating self-attention and feed-forward layers [60]. Un- 
like DLRMs, advancements in LLM modeling have been 
more focused on the processing layers than embeddings, 
reinforcing the importance of compute in LLM execution. 

5. LLM-Mok. In the context of LLMs, one way to apply 
MoE is to replace the feed-forward layer in transformer 
blocks with experts. By applying this technique, FLOPs 
per token grows at a slower rate than overall model ca- 
pacity, leading to more efficient training and inference. 
While FLOPs becomes less of a concern, non-blocking 
inter-expert communication — especially during training — 
becomes a larger systems concern. 


For each of these model architectures, we are interested in pre- 
training, fine-tuning and inference. Pre-training stresses all of 
compute, memory capacity, and communication as it involves 
both forward and backward passes — along with retaining inter- 
mediate activations from the forward pass. The requirements 
of fine-tuning are a subset of pre-training, as the frozen param- 
eters of a model do not require updates, and memory capacity 
and communication requirements are slightly loosened. In- 
ference only requires the forward pass so compute is usually 
proportionally larger. 


2.2. Parallelization Strategies 


A model layer can be either replicated or sharded across 
devices. We explore the following parallelization strategies 
(Figure 2 shows forward pass execution): 

1. Fully Sharded Data Parallelism (FSDP). Parameters are 
sharded across devices. Before layer computation, miss- 
ing parameter shards are gathered from other devices via 
AllGather. During backward pass, weight gradients are 
reduced and sharded via ReduceScatter. 

2. Tensor Parallelism (TP). Parameters are sharded across 
devices. During forward pass, each device uses its local 
parameter shard to compute a partial sum. Devices then 
communicate via AllReduce to find aggregate sum. Same 
principle is applied during backward pass for gradients. 

3. Distributed Data Parallelism (DDP). Parameters are repli- 
cated across devices. During forward pass, each device 
acts independently for computation. During backward pass, 
devices All Reduce for aggregate weight gradients. 

We apply one parallelization strategy for each layer type. Fig- 

ure 2 depicts different parallelization strategies for an MLP 

layer and vanilla model parallel (MP) sharding for the em- 
bedding tables. Additionally, parallelization strategies can be 

applied hierarchically for multi-node systems, creating N-D 

parallelism strategies. 


3. Characterization 


In this section, we first characterize a suite of real-world large 

ML models with respect to their model capacity, parameter 
breakdowns, FLOPs, and memory bandwidth characteristics 
(Section 3.1). To get a better understanding of the models’ 
communication requirements, we conduct a fleet-wide charac- 
terization of at-scale training experiments (Section 3.2). 


3.1. Individual Model Characterization 


We first quantify the difference in compute, memory capacity, 
and bandwidth requirements between six real-world recom- 
mendation models and LLMs: DLRM-{A, B, C}, GPT-3 
175B, LLaMA-65B, LLaMA 2-70B. Figure 3 quantifies this 
diversity of requirements with two key observations: 
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Figure 3: For large ML models, key system resource — (a) capacity, (b) compute, (c) bandwidth — requirements vary by orders of magnitude. 
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Figure 4: (a) Compute and exposed communication make up the majority of observed at-scale training cycles. (b) The degree of communication 
overlapped with computation and data loading is workload dependent. (c) Breakdown of communication collectives also varies by workload. 


O1: Parameter count — and allocation across model lay- 
ers — varies by orders of magnitude between models, im- 
pacting system capacity requirements. Recommendation 
models contain significantly more parameters than LLMs (Fig- 
ure 3 (a)). Despite variation in parameter count across LLMs, 
GPT-3 consists of roughly 2—68 x fewer parameters as com- 
pared to recommendation models. Training and deploying 
these recommendation models and LLMs require multi-node 
distributed systems, yet the size of the model governs how 
many devices (i.e., GPUs) are required to fit the entire model 
and the viable set of scale-out parallelization strategies. 


Additionally, virtually 100% of parameters in recommenda- 
tion models are used for embeddings while almost 100% of 
parameters in LLMs are dedicated to compute. This reflects 
the transformer-heavy computation of current LLMs, in con- 
trast to embedding-driven recommendation models that offer 
at-scale personalization. 


02: Recommendation models require fewer FLOPs per 
sample as compared to LLMs, yet require >20x higher 
memory bandwidth for sparse lookups. Figures 3 (b, c) 
demonstrate how recommendation models and LLMs show 
opposite trends for compute requirements as compared to 
sparse lookup bandwidth. Sparse lookup bandwidth require- 
ments for recommendation models far surpass LLMs — a fact 
that is consistent with a higher proportion of parameters being 
dedicated to embeddings. However, the opposite is true for 
compute requirements, as LLMs require significantly higher 
FLOPs per sample. As discussed in Section 4, these varying 
system requirements play an important role in the design of 
an optimal parallelization strategy for each model. 


3.2. Fleet-wide Communication Characterization 


In addition to model-level characterization, we look at fleet- 
wide model training. We observe, over an extended period of 
time, the importance of communication for training the latest 
DLRM-style models and LLMs. Figure 4 quantifies the role 
of communication with two key observations: 

03: Compute and exposed communication make up the 
majority of observable training GPU cycles. Compute, de- 
fined as cycles with either device computation or memory 
lookups (orange) and exposed communication, defined as cy- 
cles with only inter-device communication (blue), make up 
>82% of all observable training GPU cycles for both DLRM 
and LLMs (Figure 4 (a)). The rest of the cycles are attributed 
to host-device communication — exposed memcpy (yellow) — 
and inactivity due to data ingestion, kernel launch overhead, 
etc. — GPU idle (grey). From this observation, we focus 
our performance modeling efforts on predicting the expected 
behavior of compute and communication cycles. 

O4: Model architecture and parallelization strategy 
differences impact both the amount of computation- 
communication overlap and the types of communication 
collectives used. When model training spans multiple devices, 
replicating or sharding model components leads to communi- 
cation calls for parameters, activations and/or gradients. Being 
able to overlap these communication calls with computation so 
that the training devices are doing “useful work" is important 
for utilization. Figure 4 (b) shows that ~50% of communica- 
tion calls for DLRM training are overlapped with computation, 
whereas >65% of communication calls for compute-dominated 
LLMs are overlapped. 
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Figure 5: Our performance model works in five stages. After workload specifications and layer execution orders are established, traces 
for individual layer execution are generated and then combined with required communication calls to form complete computation and 


communication streams. 


Figure 4 (c) shows the spread of different communication 
collectives during training. For DLRM models, A112A11 is 
heavily emphasized while LLMs spend the majority of their 
communication cycles on AllReduce. This is a direct result 
of model architecture difference, and thus active paralleliza- 
tion strategy. Since DLRMs require large amounts of sparse 
lookups from sharded embedding tables, the per-device unique 
embedding lookups have to be distributed to each device via 
A112A11. On the contrary, LLMs have fewer parameters and 
are more amenable to replication of compute parameters, al- 
lowing for DDP opportunities that require Al1Reduce for 
aggregating weight gradients. 

In this section, we characterize real-world large ML models 
from model architecture and distributed training perspectives. 
From Section 3.2 we see that model architectures and the way 
in we map them onto distributed systems significantly impacts 
system resource utilization, and thus overall performance. To 
better understand how to best map current and future large 
ML models onto different distributed systems, we propose an 
agile, at-scale accurate performance model. 


4. Proposed Design 


In this section, we go over the design of our performance 
model and walk through how the model simulates distributed 
ML workloads. First, we provide an overview of the design 
and key assumptions made behind its design process (Section 
4.1). Next, we explain how the performance model processes 
ML model layers by the layers’ primary characteristics (Sec- 
tion 4.2). Finally, we cover how individually processed layers 
are pieced together into complete computation and communi- 
cation streams via the required communication calls of a given 
parallelization strategy (Section 4.3). 


4.1. Design Overview. 


Figure 5 highlights the five main processes of our performance 
model via a DLRM-Transformer example. The performance 
model treats individual layers of an ML model as core blocks 
for generating per-device execution traces. To simulate the 
per-iteration behavior of a distributed ML workload, these ex- 
ecution traces are then pieced together with the required com- 


munication calls of the target parallelization strategy. From 

per-iteration behavior, the performance model generates esti- 

mations of overall throughput and other system-level serialized 
and overlapped execution breakdowns. 

Users have to provide three JSON files for: 1) model ar- 
chitecture via layer-specific configurations (e.g., number of 
MLP layers, embedding table dimension, number of trans- 
former layers and heads), 2) distributed system specifica- 
tions (e.g., Tensor Float (TF32) utilization, HBM peak band- 
width, AllReduce intra-node interconnect utilization), and 
3) task and parallelization strategy (e.g., pre-training/fine- 
tuning/inference, intra-/inter-node parallelization strategy, 
intra-/inter-node parallelization degrees). See Section 5 for 
a more exhaustive list of currently supported configuration 
options. 

With these configurations, individual layers are first pro- 
cessed by their primary system requirements. Examples in- 
clude estimating embedding bag execution by the amount of 
embeddings to look up and per-GPU high-bandwidth memory 
(HBM) memory bandwidth and the time it takes to execute 
a transformer encoder layer by TF32 compute throughput. 
Based on the replication and sharding specified by the target 
parallelization strategy, the required communication calls are 
processed by collective-specific intra- (e.g., NVLink) and inter- 
(e.g., Infiniband, RDMA over Converged Ethernet (RoCE)) 
node communication bandwidths. 

We take into account task-level requirements (i.e., pre- 
training/fine-tuning/inference) to construct per-device com- 
putation and communication streams with data dependencies 
and potential computation-communication overlap. 

Assumptions: 

e Since we are primarily focused on large-models, underlying 
distributed systems are multi-device in nature. For multi- 
device execution, a first-order analysis of execution behav- 
ior and overall performance can be estimated via modeling 
per-node layer execution and inter-node parallelization com- 
munication. Kernel-level improvements (e.g., [43]), while 
not the focus of this work, can be effectively modeled as 
increased compute and memory lookup utilization. 

e The performance model assumes that the entire model can 


be fit onto the training/inference devices (i.e., when sharded, 
the model can fit onto GPUs). Recent high-performance 
training platforms target this design point [37]. Design 
points where model parameters have to be shuffled back and 
forth between CPU and device are currently unsupported. 

e Device-host communication (e.g., CPU-GPU data loading) 
is relatively a second-order consideration and mostly over- 
lapped and hidden between training/inference iterations. 
This observation is shared in [37] and our fleet-wide charac- 
terization in Section 3.2, Figure 4. 


4.2. Processing Individual Model Layers 


Layers are processed by their main system requirement. For 
example, we illustrate how MLP and embedding bag perfor- 
mance are estimated differently for our Figure 5 example. 
Compute Blocks. Assuming that compute time is the main 
bottleneck for MLPs, we estimate compute time per layer as: 


~ (FLOPs per layer) / [((GPU peak FLOPS) * Compute 
utilization] 


where FLOPs per layer is determined by the MLP layer’s 
dimensions and target batch size. GPU peak FLOPS are heav- 
ily dependent on data type (e.g., 32-bit, 16-bit FP/TF/BF) 
and whether or not tensor cores are enabled. We incorporate 
compute utilization — or in the case of GPUs, SM utiliza- 
tion/occupancy — as a factor in [0,1]. Typical compute uti- 
lization factors for A100s on layers in our models of interest 
are ~70%. We adopt a similar approach for modeling self- 
attention and fully-connected (FC) layers found in transformer 
layers, where FLOPs per layer is estimated by additional fac- 
tors such as attention dimension and context length. 
Embedding Bags. Assuming that lookup time is the main 
bottleneck for embedding bags, we estimate lookup time as: 


~ (Lookup bytes per GPU) / (HBM BW) * HBM utilization] 


where Lookup bytes is determined by the number of embed- 
ding tables, number of lookups per embedding table, embed- 
ding dimension, and embedding precision. Lookup bytes per 
GPU is highly parallelization strategy dependent. In this case, 
we assume that the embedding table is evenly sharded across 
GPUs in terms of both capacity and number of lookups. If the 
number of lookups are unevenly distributed between GPUs, 
we can adjust the lookup bytes per GPU on a per-GPU ba- 
sis [55]. HBM utilization is a factor between [0,1] and typical 
values for embedding bags of interest are ~80% for A100s. 


4.3. Piecing Together Computation and Comm. Streams 


Specifying Explicit Execution Order. To generate per- 
device traces for different ML tasks, an explicit execution 
priority must be established for the different layers. In Fig- 
ure 5, we can establish the order as such (1) Embedding, (2) 
Bottom MLP, (3) Transformer, (4) Top MLP. During back- 
ward pass, the execution order will be reversed. If the target 
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Figure 6: Sample generated GPU compute and communication 
streams with labeled exposed communication. 


task is fine-tuning, we also specify frozen layers, reducing un- 
necessary computation and communication of certain weight 
gradients. 

Generating Parallelization-Specific Streams. An explicit 
execution order by itself is not enough to construct accurate 
traces. A target parallelization strategy is required to specify 
required the communication collectives. Explicit data de- 
pendencies, along with parallelization strategy determine the 
blocking/non-blocking nature of the communication calls. In 
Figure 5, MLP and transformer layers are distributed via DDP 
while embedding tables are distributed via sharding. 

Figure 6 illustrates generated forward pass streams from our 
DLRM-Transformer example. We see that the traces are slot- 
ted into a compute stream and communication stream. Each 
trace will have dependencies that come explicitly from execu- 
tion annotations and implicitly from underlying parallelization 
strategies. For example, EMB has an explicit output dependency 
of Bot_MLP_0 and implicit output dependency of EMB_c_A2A 
from sharding the embedding table. EMB_c_A2aA is blocking 
since Transformer_Attn_0 needs EMB_c_A2A’s results. 

Estimating Communication Collective Execution. For 
All2All, we estimate its execution as: 


~ (“SendCount” Bytes per GPU) / (Effective AII2All BW) 


where “SendCount” Bytes per GPU is the number of bytes 
sent by each GPU to every other GPU. “SendCount” Bytes 
per GPU is dependent on not only “Lookup bytes per GPU" 
but also the sharding degree and number of devices. Since 
the AlI2All NCCL implementation is composed of individual 
point-to-point Send() and Recv() calls, it is bound by the 
slowest level of interconnect. Thus, Effective AH2AHN BW is 
set as that of either Infiniband or RoCE. For other cases, like 
an 8-GPU system, Effective All12All BW may be NVLink BW. 

Likewise, we can generate a similar set of traces for the 
backward pass. Since the MLP and transformer layers are 
parallelized via DDP, we have non-blocking AllReduce com- 
munication calls during the backward pass. The AllReduce 
calls are for aggregating per-layer weight gradients and are 
thus non-blocking (i.e., they are not on the critical path for 
backpropagation). We estimate the non-blocking AllReduce 
calls for weight gradient calls as: 


~ (“SendBuffer” Bytes / GPU) / (Effective AllReduce BW) 


Measured Performance | Modeling 
Evaluation Metric Result Model Accuracy 
Result (%) 
Serialized Iteration Time (ms) 67.40 ms 65.30 ms 96.89% 
DLRM-A % Communication Exposed (%) 82.37% 75.46% 91.62% 
1.2 MOPS [37] | 1.21 MQPS | 99.17% 
DLRM-B Throughput (MQPS) 3.4 MOPS [37] | 3.06 MQPS 90% 
GPU Hours for 306k steps 1,022,361 863,397 84.66% 
LLaMA-70B (2048 A100s) Hrs Hrs ` 
Days to Train 1.4T Tokens 20.83 Days [58] 19.21 Days 92.27% 


Table 1: Validation of various first-order execution metrics. 


where “SendBuffer” Bytes is the total number of bytes sent 
by each GPU and is directly proportional to the number of 
parameters in each layer. Effective AllReduce BW is a ratio 
of intra-node communication (e.g., NVLink) bandwidth and 
inter-node communication (e.g., Infiniband or RoCE) since 
data is communicated on both classes of channels for the 
NCCL implementation. The exact ratio between the two com- 
munication technologies is dependent on factors like the num- 
ber of nodes and NCCL implementation version (e.g., ring vs. 
tree). We use real hardware measurement data to understand 
what these effective interconnect ratios and bandwidths are in 
practice. Large-scale training also often exhibits non-constant 
bandwidth across intra- and inter-node hierarchies. We also 
consider AllGather and ReduceScatter communication calls, 
which are required in FSDP and TP. 

Factoring In Computation-Communication Overlap. 
We maintain separate computation and communication 
streams and overlap traces with no data dependencies. In 
this performance model, we assume GPU kernels are launched 
whenever data dependencies are resolved. Ideally, we want 
as much overlap between computation and communication 
as possible. As we can see in Figure 6, there is a segment of 
exposed communication for the AlI2All where compute and 
memory units of the training device (i.e., GPU) are mostly idle 
and thus underutilized. 

This performance model allows us to both identify combi- 
nations of kernels and parallelization strategies that lead to 
exposed communication and experiment with different paral- 
lelization strategies to decrease exposed communication seg- 
ments. Optimizing for computation-communication overlap 
is an important objective across multi-node, large-scale ML 
workloads. Currently, 14~32% of GPU cycles on the training 
clusters come from exposed communication (Figure 4). 


5. Experimental Methodology 


This section describes our validation efforts and details the 
design space in this work, e.g., variations of real-world models, 
hierarchical parallelization strategies, and hardware platforms. 

Performance Model Validation. Table 1 lists validation 
points of various first-order execution metrics across real, mea- 
sured recommendation and LLM training experiments. For 
DLRM-A training [37], we validate the performance model 
for first-order execution metrics of serialized iteration time, 
% communication exposed, and training throughput to 96.89, 
91,62, and 99.17% modeling difference, respectively. Addi- 
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Figure 7: DLRM-A serialized and overlapped execution validation 
for 8-, 128-GPU training. 


tionally, in Figure 7, we validate DLRM-A training for com- 
plete serialized and overlapped execution behavior across both 
8- and 128-A100 ZionEX platforms. We validate serialized be- 
havior to ensure accuracy of model layers and parallelization 
communication, overlapped behavior to account for at-scale 
latency-hiding opportunities and systems of different number 
of nodes to see networking scaling effects. 


In addition to DLRM-A, we also validate DLRM-B training, 
reporting 3.05 MQPS from our model against a measured 3.4 
MOPS. For the largest LLaMA configuration (LLaMA-70B), 
our performance model estimates training time for all 1.4T 
tokens to take 19.21 days as opposed to the reported 21 days 
in [58]. For this use-case, we use the same hardware platform 
as reported in [58] (i.e., 2048 80GB HBM A100s). We also 
validate the aggregate GPU Hours to train for 306k steps for 
84.66% modeling accuracy. We elaborate on avenues for 
further modeling accuracy in Section 7. 


Model Variations. Table 2 lists the suite of large ML mod- 
els explored in Section 6. We explore transformer and MoE 
variants of real-world DLRM-A and DLRM-B. The trans- 
former feature interaction variants have 4 layers and a down- 
sampled sequence length of 80. MoE variants are configured 
with 16 experts (2 active) per layer. For the LLM models, 
we follow specifications in [6, 58, 59]. For LLM-MoE, we 
explore a hypothetical 1.8T parameter model with 16-(2 ac- 
tive)way MoE for the MLPs in transformer blocks. We use 
fixed global batch sizes as specified in prior studies [37, 58] 
to maintain target model accuracy. 


Design Space Exploration. We use FSDP [73] as the base- 
line due to its wide adoption and ability to best guarantee 
training feasibility by minimizing sharding memory footprint. 
We explore valid hierarchical parallelism strategies at intra- 
and inter-node levels, considering combinations of DDP, FSDP, 
and TP. For hardware, unless otherwise stated, we use train- 
ing systems from prior case studies [37, 58] (Table 3). We 
also explore implications of using H100 and H100 SuperPOD 
systems by replacing our A100-based models with H100 spec- 
ifications [41, 42] — i.e., A700+ and A/00+ (Inter+). 


DLRM-A [37] | DLRM-A Transformer | DLRM-A MoE | DLRM-B [37] | DLRM-B Transformer | DLRM-B MoE | GPT-3 [6] | LLaMA [58] | LLaMA2 [59] | LLM-MoE 
# Parameters 793B 795B 332B 333B 175B 65.2B 70B 1.8T 
FLOP» 638M 2.6B 957M 60M 2.1B 90M 350B 130.4B 140B 550B 
per sample/token 
Sparse Lookup Bytes 22.61 MB 13.19 MB 49.2 KB 32.8 KB 42.8 KB 
per sample/token 
Global Batch Size 64K 256K 2K (4M tokens) 
Context Length N/A 80 N/A 80 N/A 2048 4096 8192 
Table 2: Target recommendation models, LLMs, and their variants by key model-level characteristics. 
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Figure 8: We can improve pre-training performance over FSDP baseline by applying intra- and inter-node parallelization strategies for base 
dense and transformer layers separately. Throughput-optimal parallelization strategies are listed in (intra-, inter-) order. Black and white, 
underlined text refer to recommendation base dense and transformer layers, respectively. 


DLRM LLM 
Training System [37] | Training System [58] 
Base device NVIDIA A100 40GB | NVIDIA A100 80GB 
Devices per node 8 
# nodes 16 256 
Peak TF32 throughput 20 PFLOPS 319 PFLOPS 
HBM capacity 5 TB 164 TB 
HBM bandwidth 199 TB/s 3.96 PB/s 
Intra-node interconnect 
bandwidth (unidirectional) 38E UBS Clee TBs 
Inter-node interconnect fabric RoCE Infiniband 
Inter-node interconnect 
bandwidth (unidirectional) 25.6 Tbps 409.6 Tbps 


Table 3: Baseline distributed systems used in evaluation. 


6. Evaluation Results and Analysis 


When parallelization strategies are tailored to specific deep 
learning models and tasks at hand, we can achieve 8~ 124% 
throughput improvement. Figure 8 overviews pre-training 
throughput of key large ML models (Table 2) normalized 
to the baseline. We achieve, on average 65.9% pre-training 
throughput improvement (blue bars) over FSDP by tuning 
parallelization strategies at the layer-type granularity. The 
strategy that achieves optimal training throughput is indicated 
in parenthesis. For example, when considering the base dense 
layers of DLRM-A, applying Tensor Parallelism within a node 
of 8 GPUs and Distributed Data Parallelism across nodes of 
GPUs (i.e., (TP, DDP)) leads to optimal pre-training through- 
put. In cases like DLRM-A Transformer, where both base 
dense and transformer layers are present, the optimal way to 
parallelize each type of layer may differ. 

Additionally, we also indicate, via the orange dotted bars, 
the expected throughput improvement from optimizing paral- 
lelization strategies if model parallelization is not constrained 
by the current distributed systems’ memory capacity. The 
throughput-optimal parallelization strategy and its expected 
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Figure 9: DLRM-A Pre-training. Considering memory capacity 
constraints, applying TP and DDP for intra- and inter-node paral- 
lelism, respectively on base dense layers achieves highest throughput. 
Gray bar indicates invalid parallelism strategy due to OOM. 


improvement are determined by a multitude of factors, such as 
underlying model architecture, underlying distributed system, 
and specific task. We highlight 7 key observations and discuss 
the underlying insights: 

Insight 1: [DLRM] Trillion-parameter embedding ta- 
bles in DLRMs limit parallelization strategies for the tables 
to sharding, shifting overall parallelization strategy explo- 
ration to focus on the dense components (Figure 9). 

Since embedding tables of DLRM-A make up 99.96% of 
its 793B parameters, the only parallelization strategy viable 
for DLRM embedding tables on current GPU systems is naive 
model parallelism sharding. This leaves parallelization strat- 
egy exploration on the base dense layers. Figure 9 demon- 
strates that, over valid parallelization strategies of the base 
dense layers on the x-axis, training throughput performance of 
DLRM.-A can vary significantly from 0.19 ( (TP), (MP)) to 
1.14 x ( (TP, DDP), (MP)) over the FSDP baseline. Apply- 
ing TP scales communication requirements with size of partial 
sums and activations. If we apply TP at the intra-node level — 
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Figure 10: Between DLRM variants, both optimal parallelization 
strategy and expected throughput improvement vary. 


as opposed to globally — we can take full of advantage of high 
BW NVLink to communicate the partial sums and activations. 
In this case, ( (DDP), (MP) ) is prohibitive due to OOM since 
it necessitates replicating the dense layers’ model parameters, 
gradients, and optimizer states across all devices. 

Insight 2: [LLMs] The billion-parameter scale of trans- 
former layers in LLMs makes intra-node replication for 
compute layers infeasible. In contrast, the small mem- 
ory footprint of word embeddings (< 2GB) allows it to be 
replicated across all devices via DDP . 

In contrast to DLRMs, for LLMs such as GPT-3, the FSDP 
baseline offers competitive training throughput performance 
(Figure 8). Since the word embeddings of LLMs are relatively 
small (0.37% of GPT-3), full per-device embedding replication 
is a viable option via DDP. As in the DLRM cases, we focus 
our parallelization strategy exploration on the compute-bound 
layers. However, in the case of GPT-3, any form of layer 
replication across nodes (e.g., (TP, DDP)) leads to OOM 
since intra-node sharding is insufficient for meeting memory 
capacity requirements. Additional device memory capacity 
can unlock up to 1.68x training throughput improvement. 

Insight 3: [Parallelization Strategy Order] Ordering of 
hierarchical parallelization strategies matter. Replication 
and sharding strategies must be placed in the correct order 
to ensure optimal performance. (Figures 8, 9). 

The “order" in which we apply hierarchical parallelization 
strategies matters greatly in terms of both memory capacity 
footprint and expected throughput. For example, applying 
((TP), (DDP)) shards the model component by number of 
devices in a node while applying ( (DDP), (TP) ) shards the 
component by number of nodes. In Figure 9, where there are 
8 GPUs within a node and 16 nodes, the latter strategy leads 
to a lower per-GPU memory footprint. Additionally, training 
throughput also varies from using different interconnect chan- 
nels for communication. For example, ( (TP), (DDP) ) leads 
to AllReduce of activations over faster NVLink and weight 
gradients over slower RoCE/IB. On the other hand, ( (DDP) , 
(TP) ) leads to communicating activations over RoCE/IB and 
weight gradients over NVLink. For LLMs, long context 
lengths increase the size of activations to be communicated, so 
applying inter-node TP leads to significant slowdown (0.18 x 
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Figure 11: Pareto curves of parallelization strategies for DLRM 
variants for (a) pre-training and (b) inference. Each point is a different 
parallelization strategy. 
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Figure 12: Given increasing context lengths, solely altering paral- 
lelization strategies has diminishing returns for performance benefits 
over FSDP. 


for GPT-3). On the other hand, utilizing NVLink to communi- 
cate large activations leads to 1.34x speedup for GPT-3. 
Insight 4: [DLRM Variants] DLRM Transformer and 
MoE variants introduce new compute and communica- 
tion requirements, leading to new parallelization strategy 
choice and task-level implications. (Figures 10, 11). 
Figure 10 shows how the same set of parallelization strate- 
gies interacts with both DLRM-A and its variants. For 
DLRM-A Transformer, we apply ( (TP), (DDP) ) on the base 
dense layers since that is the optimal strategy for DLRM-A 
and focus parallelization strategy exploration on transformer 
layers. Across the variants, optimal strategy (yellow star) 
varies. These differences can be attributed to how trans- 
formers introduce more compute and more opportunities for 
communication-computation overlap while MoE increases 
blocking, non-overlapping All2All communication. As mod- 
els continue to evolve, parallelization strategies will as well. 
Figure 11 shows the parallelization strategy design space 
for DLRM-A and its model architecture variants by per- 
device memory capacity requirement and achievable pre- 
training/inference throughput. We denote the performance- 
pareto curve with solid lines, showing how an increase in 
memory capacity can lead to parallelization strategies with 
higher throughputs. We also observe that for pre-training, 
transformer and MoE variants have lower throughput from the 
additional computation and communication, respectively. For 
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Figure 13: For DLRM-A pre-training, both overall GPU improve- 
ment (A100+) and specifically upgrading inter-node interconnect 
fabric lead to observable performance benefits. 


inference, MoE variants have higher throughput than trans- 
former variants since the newly introduced communication 
calls are asymmetrically distributed in the backward pass. 

Insight 5: [Context-Length] Increasing context-lengths 
limits the improvements from parallelization strategy opti- 
mizations, necessitating either changes in model architec- 
ture or underlying distributed systems (Figure 12). 

Figure 12 shows that input complexity, in terms of context 
length, plays a key role in training throughput. We investi- 
gate the effectiveness of ( (DDP) ) and ( (TP), (DDP) ) across 
LLMs of increasing context lengths. 2K and 4K context length 
examples refer to LLaMA and LLaMA2 while the 8K con- 
text length data point comes from doubling base LLaMA2’s 
context length while keeping model architecture constant. 


We see that throughput gains from tuning parallelization 
strategy decreases with increasing context length, indicating 
the limits of optimizing this design space. To further improve 
throughput performance, changes have to be made to either 
the underlying distributed system or ML model architecture. 

Insight 6: [GPU-Generations] Across generations of 
GPUs, improvements in compute, memory, and intercon- 
nect not only improve distributed ML performance but 
also unlock different viable parallelization strategies. 

In Figure 13, we compare the A100 against a GPU with 
H100’s specifications (denoted as “A100+”). We also con- 
sider the H100 SuperPOD configuration, where the RoCE/IB 
inter-node interconnect fabric is replaced by NVLink (i.e., 
“A100+ (Inter+)”), leading to ~4.5 x inter-node interconnect 
bandwidth compared to H100 DGX systems. 

Compared to the A100 baselines (blue), using A100+ (or- 
ange) leads to varying degrees of speedup for the different 
parallelization strategies. The exact speedup numbers differ 
due to the fact that compute, memory, and networking im- 
prove at different rates when we replace A100s with A100+s 
and different strategies emphasize different system resources. 
For DLRM-A training, improving inter-node bandwidth (i.e., 
A100+ to A100+ (Inter+)) by itself leads to significant through- 
put improvement of 1.82x since the blocking All2All embed- 
ding communication calls are directly accelerated. 

Insight 7: [Future Technologies Trends] For large ML 
workloads, improving individual hardware components 
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Figure 14: Individually scaling different hardware capabilities for 
(a) DLRM-A and (b) GPT-3 workloads leads to sub-linear speedup. 
Concurrently improving all capabilities leads to super-linear speedup. 


leads to limited throughput gain. Unlocking further perfor- 
mance requires jointly improving hardware and systems 
specifications (Figures 14, 15). 

From A100 to A100+, compute, memory capacity, memory 
bandwidth, intra-node interconnect bandwidth, inter-node in- 
terconnect bandwidth improve by 2.42x, 2x, 1.29x, 1.5x, 
2x (9x for SuperPOD), respectively. In Figure 14, we per- 
form a hardware scaling study where compute, memory ca- 
pacity and bandwidth, intra- and inter-node interconnect band- 
width are all improved by 10x separately and concurrently. 
We observe the effects of these improvements on DLRM-A 
and GPT-3 training and inference. 

For DLRM-A pre-training and inference, independently 
improving anything but inter-node interconnect by 10x will 
only net 1.64 and 2.12x throughput improvements, respec- 
tively. For these use-cases, since blocking All2All embedding 
communication is performance-critical, targeting inter-node 
communication bandwidth leads to substantial performance 
improvement. For GPT-3, since compute-bound layers are crit- 
ical to overall throughput, improving just compute throughput 
leads to more workload acceleration compared to DLRMs. 

Figure 15 details the sources of the performance changes. 
Serialized execution breakdown shows execution time al- 
located to embedding lookups, GEMM, and specific com- 
munication collectives, disregarding the effects of overlap. 
Computation-communication overlap breakdown shows how 
much communication is hidden behind embedding lookups 
and GEMM. These breakdowns help us better understand 
the speedup results from Figure 14 since throughput im- 
provements can come from a variety of sources: accelerat- 
ing compute-heavy layers (e.g., compute in GPT-3), reducing 
overall communication time (e.g., AII2AII in recommendation 
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Figure 15: (a, c) Serialized execution and (b, d) communication breakdown for both DLRM-A and GPT-3 training allows us to better 


understand where speedups from hardware components come from. 


models), or even unlocking new parallelization strategies with 
more memory capacity (e.g., DDP for GPT-3). 

For all four cases, jointly improving the hardware compo- 
nents leads to super-linear performance improvement. This 
is because distributed ML execution is non-serial so improv- 
ing each of the performance of each trace segment can lead 
to more overlap or unlock new parallelization strategies alto- 
gether. 


7. Related Work and Discussion 


We discuss related work in two primary categories: paral- 
lelization strategy exploration and distributed AI performance 
modeling (Table 4). In addition, we share opportunities for im- 
proving our proposed performance model and its implications 
on efficiently scaling large ML models. 

Parallelization Strategy Exploration. [32, 68] provide 
compiler annotations for identifying efficient parallelization 
strategies. [34, 53] focus on optimizing communication collec- 
tives via fusion and scheduling. [75] focuses on operator-level 
parallelism. [4, 24] focus on parallelization strategy explo- 
ration but are evaluated on older and smaller ML models in 
Computer Vision and NLP. [71] explores strategies to overlap 
compute and communication before PyTorch. In this paper, 
we aim to detach parallelization strategy exploration from 
existing software implementation details to enable an agile 
design space exploration of potentially yet to be implemented 
models. Additionally, we target latest trillion-parameters scale 
models and expand our design space beyond just collectives. 

Distributed AI Performance Modeling. [49] provides 
an analytical model for transformer inference on TPUs. [46] 
projects computation-communication overlap opportunities 
for future GPU-centric hardware. [52, 65] provide a simulator 
for estimating distributed ML performance that is validated 
against AllReduce collectives. [29] builds upon [52, 65] to 
introduce a design space exploration tool, yet doesn’t focus 
on optimizing training throughput for specific use cases like 
DLRM models. These works build upon earlier work in simu- 
lating [54, 36] and characterizing [22, 23] distributed systems. 
[63] emphasizes network optimization. [33] focuses on gen- 
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erating replayable traces to better estimate hardware resource 
utilization. [57] is an effort to standardize traces across differ- 
ent software implementations for fair comparisons and gen- 
erating synthetic traces, which can potentially be integrated 
with our performance model for better integration with current 
software implementations. We design our performance model 
to be compatible with different hardware platforms, tasks, and 
exploration objectives. We also focus on large ML model 
execution behavior and validate accordingly. 


Even though our proposed performance model success- 
fully navigates the parallelization strategy co-design space 
(Section 6), we foresee extensions in modeling memory re- 
quirements more accurately and integrating more real-world 
production characteristics. 

Memory Estimation. An accurate model of peak memory 
consumption is critical for identifying feasible parallelization 
strategies. However, estimating memory consumption can 
be tricky, as operators such as convolution might allocate 
temporary buffers internally that will cause a temporary rise in 
active memory. Other considerations include modeling which 
temporaries are saved for reverse-mode differentiation and 
layer-specific activation checkpointing. 

Beyond Per-Iteration Execution. Beyond first-order mod- 
eling of per-iteration behavior, we have to consider second- 
order effects such as datacenter inter-job interference, network 
queuing delays, and job rescheduling [2]. At the hardware- 
device level, to accurately model new devices and accelerators 
(e.g., H100, TPU), we have to also take into account microar- 
chitecture, utilization, and software optimization differences. 


Environmentally Sustainable Model Development and 
Deployment. Finally, enabling higher throughput training 
and inference for fixed infrastructure capacity directly im- 
proves the cost-effectiveness for AI datacenters. In the short 
term, increasing system throughput decreases operational and 
embodied carbon footprint [67]. In the long term, through- 
put/efficiency optimizations can increase adoption of large ML 
models — potentially leading to larger overall carbon footprint 
due to the rebound effect [66]. For sustainable AI model de- 
velopment, we must keep in mind optimizing ML for efficient 
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Pope. et. al. [49] v v (TPU only) Leder 
: v (GPU w/ pure-DP,TP 4-GPU GEMM, 
Patt Gh al [6] y only, unvalidated) AllReduce 
2 v (unvalidated, 
ASTRA-Sim [52] v v se (681) (see [65]) 
: i 16-GPU 
ASTRA-Sim 2.0 [65] v v v (unvalidated) AllReduce 
; 64-GPU 
Mystique [33] y training 
v (unvalidated, 
Chakra [57] v v v [52, 65]) (see [65]) 
MAD-Max ; 128, 2K-GPU training 
» v v v v v (Figure 7) (DLRM, Transformer, 
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Table 4: Related work for distributed AI performance modeling. 


development and deployment, using renewable energy in op- 
eration [47] and further reducing negative impacts of system 
hardware manufacturing on the environment [13, 15]. 


8. Conclusion 


We present an agile performance modeling framework to 
enable large ML model acceleration across the key model 
development phases: pre-training, fine-tuning, and inference. 
The framework is also validated against large-scale infras- 
tructures used for state-of-the-art ML tasks. Using the suite 
of real-world large ML models on GPU training hardware, 
we demonstrate 2.24x and 1.48x throughput improvement 
potential for pre-training and inference, respectively. 
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